23 research outputs found
Model Selection in Data Analysis Competitions
Abstract. The use of data analysis competitions for selecting the most appropriate model for a problem is a recent innovation in the field of predictive machine learning. Two of the most well-known examples of this trend was the Netflix Competition and recently the competitions hosted on the online platform Kaggle. In this paper, we will state and try to verify a set of qualitative hypotheses about predictive modelling, both in general and in the scope of data analysis competitions. To verify our hypotheses we will look at previous competitions and their outcomes, use qualitative interviews with top performers from Kaggle and use previous personal experiences from competing in Kaggle competitions. The stated hypotheses about feature engineering, ensembling, overfitting, model complexity and evaluation metrics give indications and guidelines on how to select a proper model for performing well in a competition on Kaggle.
Inferring Person-to-person Proximity Using WiFi Signals
Today's societies are enveloped in an ever-growing telecommunication
infrastructure. This infrastructure offers important opportunities for sensing
and recording a multitude of human behaviors. Human mobility patterns are a
prominent example of such a behavior which has been studied based on cell phone
towers, Bluetooth beacons, and WiFi networks as proxies for location. However,
while mobility is an important aspect of human behavior, understanding complex
social systems requires studying not only the movement of individuals, but also
their interactions. Sensing social interactions on a large scale is a technical
challenge and many commonly used approaches---including RFID badges or
Bluetooth scanning---offer only limited scalability. Here we show that it is
possible, in a scalable and robust way, to accurately infer person-to-person
physical proximity from the lists of WiFi access points measured by smartphones
carried by the two individuals. Based on a longitudinal dataset of
approximately 800 participants with ground-truth interactions collected over a
year, we show that our model performs better than the current state-of-the-art.
Our results demonstrate the value of WiFi signals in social sensing as well as
potential threats to privacy that they imply
On the number of spanning trees in random regular graphs
Let be a fixed integer. We give an asympotic formula for the
expected number of spanning trees in a uniformly random -regular graph with
vertices. (The asymptotics are as , restricted to even if
is odd.) We also obtain the asymptotic distribution of the number of
spanning trees in a uniformly random cubic graph, and conjecture that the
corresponding result holds for arbitrary (fixed) . Numerical evidence is
presented which supports our conjecture.Comment: 26 pages, 1 figure. To appear in the Electronic Journal of
Combinatorics. This version addresses referee's comment
Inferring Stop-Locations from WiFi
Human mobility patterns are inherently complex. In terms of understanding these patterns, the process of converting raw data into series of stop-locations and transitions is an important first step which greatly reduces the volume of data, thus simplifying the subsequent analyses. Previous research into the mobility of individuals has focused on inferring 'stop locations' (places of stationarity) from GPS or CDR data, or on detection of state (static/active). In this paper we bridge the gap between the two approaches: we introduce methods for detecting both mobility state and stop-locations. In addition, our methods are based exclusively on WiFi data. We study two months of WiFi data collected every two minutes by a smartphone, and infer stop-locations in the form of labelled time-intervals. For this purpose, we investigate two algorithms, both of which scale to large datasets: a greedy approach to select the most important routers and one which uses a density-based clustering algorithm to detect router fingerprints. We validate our results using participants' GPS data as well as ground truth data collected during a two month period
String Matching with Variable Length Gaps
We consider string matching with variable length gaps. Given a string and
a pattern consisting of strings separated by variable length gaps
(arbitrary strings of length in a specified range), the problem is to find all
ending positions of substrings in that match . This problem is a basic
primitive in computational biology applications. Let and be the lengths
of and , respectively, and let be the number of strings in . We
present a new algorithm achieving time and space , where is the sum of the lower bounds of the lengths of the gaps in
and is the total number of occurrences of the strings in
within . Compared to the previous results this bound essentially achieves
the best known time and space complexities simultaneously. Consequently, our
algorithm obtains the best known bounds for almost all combinations of ,
, , , and . Our algorithm is surprisingly simple and
straightforward to implement. We also present algorithms for finding and
encoding the positions of all strings in for every match of the pattern.Comment: draft of full version, extended abstract at SPIRE 201
Optimisation of Car Park Designs
The problem presented by ARUP to the UK Study Group 2014 was to investigate methods for maximising the number of car parking spaces that can be placed within a car park. This is particularly important for basement car parks in residential apartment blocks or offices where parking spaces command a high value. Currently the job of allocating spaces is done manually and is very time intensive.
The Study Group working on this problem split into teams examining different aspects of the car park design process There were three approaches taken. These approaches include a so-called "tile-and-trim" method in which an optimal layout of cars from an `infinite car park' are overlaid onto the actual car park domain; adjustments are then made to accommodate access from one lane to the next. A second approach seeks to develop an algorithm for optimising the road within a car park on the assumption that car parking spaces should fill the space and that any space needs to be adjacent to the network. A third similar approach focused on schemes for assessing the potential capacity of a small selection of specified road networks within the car park to assist the architect in selecting the optimal road network(s).
The problem is a variant of the "bin packing" problem, well known in computer science. It is further complicated by the fact that two different classes of item need to be packed (roads and cars), with both local (immediate access to a road) and global (connectivity of the road network) constraints. Bin-packing is known to be NP-hard, and hence the problem at hand has at least this level of computational complexity.
None of the approaches produced a complete solution to the problem posed. Indeed, it was quickly determined by the group that this was a very hard problem (a view reinforced by the many different possible approaches considered) requiring far longer than a week to really make significant progress. All approaches rely to differing degrees on optimisation algorithms which are inherently unreliable unless designed specifically for the intended purpose. It is also not clear whether a relatively simple automated computer algorithm will be able to "beat the eye of the architect"; additional sophistication may be required due to subtle constraints.
Apart from determining that the problem is hard, positive outcomes have included:
Determining that parking perpendicular to the road in long aisles provides the most efficient packing of cars.
Provision of code which "tiles and trims" from an infinite car park onto the given car park with interactive feedback on the number of cars in the packing.
Provision of code for optimal packing in a parallel-walled car park.
Methods for optimising a road within a given domain based on developing cost functions ensuring that cars fill the car park and have access to the road. Provision of code for optimising a single road in a given (square) space.
Description of methods for assessing the capacity of a car park for a set of given road network in order to select optimal road networks.
Some ideas for developing possible solutions further
RUNTIME DICTIONARIES FOR ROOT
ROOT is the LHC physicists' common tool for data analysis; almost all data is stored using ROOT's I/O system. This system benefits from a custom description of types (a so-called dictionary) that is optimised for the I/O. Until now, the dictionary cannot be provided at run-time; it needs to be prepared in a separate prerequisite step. This project will move the generation of the dictionary to run-time, making use of ROOT 6's new just-in-time compiler. It allows a more dynamic and natural access to ROOT's I/O features especially for user code